As the information available to lay users through autonomous data sourcescontinues to increase, mediators become important to ensure that the wealth ofinformation available is tapped effectively. A key challenge that theseinformation mediators need to handle is the varying levels of incompleteness inthe underlying databases in terms of missing attribute values. Existingapproaches such as QPIAD aim to mine and use Approximate FunctionalDependencies (AFDs) to predict and retrieve relevant incomplete tuples. Theseapproaches make independence assumptions about missing values---whichcritically hobbles their performance when there are tuples containing missingvalues for multiple correlated attributes. In this paper, we present aprincipled probabilistic alternative that views an incomplete tuple as defininga distribution over the complete tuples that it stands for. We learn thisdistribution in terms of Bayes networks. Our approach involvesmining/"learning" Bayes networks from a sample of the database, and using it todo both imputation (predict a missing value) and query rewriting (retrieverelevant results with incompleteness on the query-constrained attributes, whenthe data sources are autonomous). We present empirical studies to demonstratethat (i) at higher levels of incompleteness, when multiple attribute values aremissing, Bayes networks do provide a significantly higher classificationaccuracy and (ii) the relevant possible answers retrieved by the queriesreformulated using Bayes networks provide higher precision and recall than AFDswhile keeping query processing costs manageable.
展开▼